DECAF-FSEFI: A Fine-grained, Accountable, Flexible, and Efficient Soft Error Fault Injection Framework for Profiling Application Vulnerability
نویسندگان
چکیده
Resilient computation has been an emerging topic in the field of high-performance computing (HPC) for several years. In particular, studies show that tolerating faults on leadershipclass supercomputers (such as exascale supercomputers) is expected to be one of the main challenges. In this paper, we utilize dynamic binary instrumentation and virtual machine based fault injection to emulate soft errors and study the soft errors impact on the behavior of applications. We propose DECAF-FSEFI, a fine-grained, accountable, flexible, and efficient soft error fault injection framework built on top of QEMU. DECAF-FSEFI provides just-in-time fault injection, fault propagation trace, and flexible fault injection interfaces. In the case study, we demonstrate the usage of DECAF-FSEFI on fault injection experiments. While armed with so many features, the experiments illustrate that DECAF-FSEFI only introduces 2.48x performance overhead in the worst case and almost 0 overhead in the best case. Keywords—soft error; MPI; fault injection; resilience; vulnerability; High Performance Computing.
منابع مشابه
Characterizing the Use of Program Vulnerability Factors for Studying Transient Fault Tolerance in Multi-core Architectures
Semiconductor transient faults (soft errors) are a critical design concern in the reliability of computer systems. Most recent architecture research is focused on using performance models to provide Architecture Vulnerability Factor (AVF) estimates of processor reliability rather than deploying detailed fault injection into hardware RTL models. While AVF analysis provides support for investigat...
متن کاملEvaluating Application Vulnerability to Soft Errors in Multi-level Cache Hierarchy
As the capacity of caches increases dramatically with new processors, soft errors originating in cache memories has become a major reliability concern for high performance processors. This paper presents application specific soft error vulnerability analysis in order to understand an application’s responses to soft errors from different levels of caches. Based on a high-performance processor si...
متن کاملFast Fault Injection to Evaluate Multicore Systems Soft Error Reliability
The increasing complexity of processors allied to the continuous technology shrink is making multicore-based systems more susceptible to soft errors. The high cost and time inherent to hardware-based fault injection approaches make the more efficient simulation-based fault injection frameworks crucial to test reliability. This paper proposes a fast, flexible fault injector framework which suppo...
متن کاملA Cost-Effective Selective TMR for Coarse-Grained Reconfigurable Architectures Based on DFG-Level Vulnerability Analysis
This paper proposes a novel method to determine a priority for applying selective triple modular redundancy (selective TMR) against single event upset (SEU) to achieve cost-effective reliable implementation of application circuits onto coarse-grained reconfigurable architectures (CGRAs). The priority is determined by an estimation of the vulnerability of each node in the data flow graph (DFG) o...
متن کاملControl Flow Checking or Not? (for Soft Errors)
Control Flow Checking (CFC) techniques were proposed to provide efficient protection from soft errors. The main idea is that most soft errors will eventually manifest as errors in the sequence of instruction execution. Therefore, just by making sure that the sequence of instructions executed (or the control flow of the program) is correct, then significant protection can be achieved. Note that ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017